Extracting the Main Content of Web Documents Based on Character Encoding and a Naive Smoothing Method
نویسندگان
چکیده
This chapter presents R2L, DANA and DANAg, a family of novel algorithms for extracting the main content (MC) of web documents. The main concept behind R2L, which also provided the initial idea and motivation for the other two algorithms, is to exploit particularities of Right-to-Left languages for obtaining the MC of web pages. As the English character set and the Right-toLeft character set are encoded in different intervals of the Unicode character set, we can efficiently distinguish the Right-to-Left characters from the English ones in an HTML file. Afterwards, the R2L approach extracts areas of the HTML file with a high density of Right-to-Left characters and a low density characters from the English character set. Having recognized these areas, R2L separates only the Right-to-Left characters as a result. The first extension, DANA, improves effectiveness of the baseline algorithm by employing an HTML parser in a post processing phase of R2L for extracting the MC from areas with a high density of Right-to-Left characters. DANAg is the second extension and generalizes the idea of R2L to render it language independent.
منابع مشابه
Presentation of Competency Model Needed by Elementary Education Graduates of Farhangian University based on the theory Deliberative Inquiry
Purpose: The purpose of the present study was Presentation of Competency Model Needed by Elementary Education Graduates of Farhangian University based on the theory Deliberative Inquiry. The method of qualitative research, type of phenomenology and content analysis, was a statistical society of faculty members in the field of curriculum and all scientific sources and documents. Method: The semi...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملWeb pages ranking algorithm based on reinforcement learning and user feedback
The main challenge of a search engine is ranking web documents to provide the best response to a user`s query. Despite the huge number of the extracted results for user`s query, only a small number of the first results are examined by users; therefore, the insertion of the related results in the first ranks is of great importance. In this paper, a ranking algorithm based on the reinforcement le...
متن کاملClassification of Web Documents Using a Naive Bayes Method
This paper presents an automatic document classification system, WebDoc, which classifies Web documents according to the Library of Congress classification scheme. WebDoc constructs a knowledge base from the training data and then classifies the documents based on information in the knowledge base. One of the classification algorithms used in WebDoc is based on Bayes’ theorem from probability t...
متن کاملمدل جدیدی برای جستجوی عبارت بر اساس کمینه جابهجایی وزندار
Finding high-quality web pages is one of the most important tasks of search engines. The relevance between the documents found and the query searched depends on the user observation and increases the complexity of ranking algorithms. The other issue is that users often explore just the first 10 to 20 results while millions of pages related to a query may exist. So search engines have to use sui...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011